Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Détection et correction automatique d'entités nommées dans des corpus OCRisés

Identifieur interne : 000067 ( Main/Exploration ); précédent : 000066; suivant : 000068

Détection et correction automatique d'entités nommées dans des corpus OCRisés

Auteurs : Benoît Sagot [France] ; Kata Gábor [France]

Source :

RBID : Hal:hal-01022378

Abstract

Correction of textual data obtained by optical character recognition (OCR) for reaching editorial quality is an expensive task, as it still involves human intervention. The coverage of statistical models for automated error detection and correction is inherently limited to errors that resort to general language. However, a large amount of errors reside in domain-specific named entities, especially when dealing with data such as patent corpora or legal texts. In this paper, we propose a rule-based architecture for the identification and correction of a wide range of named entities (proper names not included). We show that our architecture achieves a good recall and an excellent correction accuracy on error types that are difficult to adress with statistical approaches.

Url:


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="fr">Détection et correction automatique d'entités nommées dans des corpus OCRisés</title>
<author>
<name sortKey="Sagot, Benoit" sort="Sagot, Benoit" uniqKey="Sagot B" first="Benoît" last="Sagot">Benoît Sagot</name>
<affiliation wicri:level="1">
<hal:affiliation type="researchteam" xml:id="struct-54505" status="OLD">
<idno type="RNSR">200818336A</idno>
<orgName>Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing</orgName>
<orgName type="acronym">ALPAGE</orgName>
<date type="end">2016-01-31</date>
<desc>
<address>
<addrLine>Université Paris Diderot, Bât. Olympe de Gouges, case postale 7003, 75205 Paris cedex 13 - INRIA Rocquencour</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/equipes/alpage</ref>
</desc>
<listRelation>
<relation active="#struct-86790" type="direct"></relation>
<relation active="#struct-300009" type="indirect"></relation>
<relation active="#struct-300301" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-86790" type="direct">
<org type="laboratory" xml:id="struct-86790" status="VALID">
<idno type="RNSR">196718247G</idno>
<orgName>INRIA Paris-Rocquencourt</orgName>
<desc>
<address>
<addrLine>INRIA Rocquencourt : Domaine de Voluceau, Rocquencourt B.P. 105 78153 le Chesnay Cedex / INRIA Paris - 23 avenue d'Italie 75013 Paris</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/centre/paris-rocquencourt</ref>
</desc>
<listRelation>
<relation active="#struct-300009" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle active="#struct-300009" type="indirect">
<org type="institution" xml:id="struct-300009" status="VALID">
<orgName>Institut National de Recherche en Informatique et en Automatique</orgName>
<orgName type="acronym">Inria</orgName>
<desc>
<address>
<addrLine>Domaine de VoluceauRocquencourt - BP 10578153 Le Chesnay Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/en/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300301" type="direct">
<org type="institution" xml:id="struct-300301" status="VALID">
<orgName>Université Paris Diderot - Paris 7</orgName>
<orgName type="acronym">UP7</orgName>
<desc>
<address>
<addrLine>5 rue Thomas-Mann - 75205 Paris cedex 13</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-paris-diderot.fr</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Gabor, Kata" sort="Gabor, Kata" uniqKey="Gabor K" first="Kata" last="Gábor">Kata Gábor</name>
<affiliation wicri:level="1">
<hal:affiliation type="researchteam" xml:id="struct-54505" status="OLD">
<idno type="RNSR">200818336A</idno>
<orgName>Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing</orgName>
<orgName type="acronym">ALPAGE</orgName>
<date type="end">2016-01-31</date>
<desc>
<address>
<addrLine>Université Paris Diderot, Bât. Olympe de Gouges, case postale 7003, 75205 Paris cedex 13 - INRIA Rocquencour</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/equipes/alpage</ref>
</desc>
<listRelation>
<relation active="#struct-86790" type="direct"></relation>
<relation active="#struct-300009" type="indirect"></relation>
<relation active="#struct-300301" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-86790" type="direct">
<org type="laboratory" xml:id="struct-86790" status="VALID">
<idno type="RNSR">196718247G</idno>
<orgName>INRIA Paris-Rocquencourt</orgName>
<desc>
<address>
<addrLine>INRIA Rocquencourt : Domaine de Voluceau, Rocquencourt B.P. 105 78153 le Chesnay Cedex / INRIA Paris - 23 avenue d'Italie 75013 Paris</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/centre/paris-rocquencourt</ref>
</desc>
<listRelation>
<relation active="#struct-300009" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle active="#struct-300009" type="indirect">
<org type="institution" xml:id="struct-300009" status="VALID">
<orgName>Institut National de Recherche en Informatique et en Automatique</orgName>
<orgName type="acronym">Inria</orgName>
<desc>
<address>
<addrLine>Domaine de VoluceauRocquencourt - BP 10578153 Le Chesnay Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/en/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300301" type="direct">
<org type="institution" xml:id="struct-300301" status="VALID">
<orgName>Université Paris Diderot - Paris 7</orgName>
<orgName type="acronym">UP7</orgName>
<desc>
<address>
<addrLine>5 rue Thomas-Mann - 75205 Paris cedex 13</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-paris-diderot.fr</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:hal-01022378</idno>
<idno type="halId">hal-01022378</idno>
<idno type="halUri">https://hal.inria.fr/hal-01022378</idno>
<idno type="url">https://hal.inria.fr/hal-01022378</idno>
<date when="2014-07-01">2014-07-01</date>
<idno type="wicri:Area/Hal/Corpus">000145</idno>
<idno type="wicri:Area/Hal/Curation">000145</idno>
<idno type="wicri:Area/Hal/Checkpoint">000027</idno>
<idno type="wicri:Area/Main/Merge">000067</idno>
<idno type="wicri:Area/Main/Curation">000067</idno>
<idno type="wicri:Area/Main/Exploration">000067</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="fr">Détection et correction automatique d'entités nommées dans des corpus OCRisés</title>
<author>
<name sortKey="Sagot, Benoit" sort="Sagot, Benoit" uniqKey="Sagot B" first="Benoît" last="Sagot">Benoît Sagot</name>
<affiliation wicri:level="1">
<hal:affiliation type="researchteam" xml:id="struct-54505" status="OLD">
<idno type="RNSR">200818336A</idno>
<orgName>Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing</orgName>
<orgName type="acronym">ALPAGE</orgName>
<date type="end">2016-01-31</date>
<desc>
<address>
<addrLine>Université Paris Diderot, Bât. Olympe de Gouges, case postale 7003, 75205 Paris cedex 13 - INRIA Rocquencour</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/equipes/alpage</ref>
</desc>
<listRelation>
<relation active="#struct-86790" type="direct"></relation>
<relation active="#struct-300009" type="indirect"></relation>
<relation active="#struct-300301" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-86790" type="direct">
<org type="laboratory" xml:id="struct-86790" status="VALID">
<idno type="RNSR">196718247G</idno>
<orgName>INRIA Paris-Rocquencourt</orgName>
<desc>
<address>
<addrLine>INRIA Rocquencourt : Domaine de Voluceau, Rocquencourt B.P. 105 78153 le Chesnay Cedex / INRIA Paris - 23 avenue d'Italie 75013 Paris</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/centre/paris-rocquencourt</ref>
</desc>
<listRelation>
<relation active="#struct-300009" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle active="#struct-300009" type="indirect">
<org type="institution" xml:id="struct-300009" status="VALID">
<orgName>Institut National de Recherche en Informatique et en Automatique</orgName>
<orgName type="acronym">Inria</orgName>
<desc>
<address>
<addrLine>Domaine de VoluceauRocquencourt - BP 10578153 Le Chesnay Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/en/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300301" type="direct">
<org type="institution" xml:id="struct-300301" status="VALID">
<orgName>Université Paris Diderot - Paris 7</orgName>
<orgName type="acronym">UP7</orgName>
<desc>
<address>
<addrLine>5 rue Thomas-Mann - 75205 Paris cedex 13</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-paris-diderot.fr</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Gabor, Kata" sort="Gabor, Kata" uniqKey="Gabor K" first="Kata" last="Gábor">Kata Gábor</name>
<affiliation wicri:level="1">
<hal:affiliation type="researchteam" xml:id="struct-54505" status="OLD">
<idno type="RNSR">200818336A</idno>
<orgName>Analyse Linguistique Profonde à Grande Echelle ; Large-scale deep linguistic processing</orgName>
<orgName type="acronym">ALPAGE</orgName>
<date type="end">2016-01-31</date>
<desc>
<address>
<addrLine>Université Paris Diderot, Bât. Olympe de Gouges, case postale 7003, 75205 Paris cedex 13 - INRIA Rocquencour</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/equipes/alpage</ref>
</desc>
<listRelation>
<relation active="#struct-86790" type="direct"></relation>
<relation active="#struct-300009" type="indirect"></relation>
<relation active="#struct-300301" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-86790" type="direct">
<org type="laboratory" xml:id="struct-86790" status="VALID">
<idno type="RNSR">196718247G</idno>
<orgName>INRIA Paris-Rocquencourt</orgName>
<desc>
<address>
<addrLine>INRIA Rocquencourt : Domaine de Voluceau, Rocquencourt B.P. 105 78153 le Chesnay Cedex / INRIA Paris - 23 avenue d'Italie 75013 Paris</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/centre/paris-rocquencourt</ref>
</desc>
<listRelation>
<relation active="#struct-300009" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle active="#struct-300009" type="indirect">
<org type="institution" xml:id="struct-300009" status="VALID">
<orgName>Institut National de Recherche en Informatique et en Automatique</orgName>
<orgName type="acronym">Inria</orgName>
<desc>
<address>
<addrLine>Domaine de VoluceauRocquencourt - BP 10578153 Le Chesnay Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/en/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300301" type="direct">
<org type="institution" xml:id="struct-300301" status="VALID">
<orgName>Université Paris Diderot - Paris 7</orgName>
<orgName type="acronym">UP7</orgName>
<desc>
<address>
<addrLine>5 rue Thomas-Mann - 75205 Paris cedex 13</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-paris-diderot.fr</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Correction of textual data obtained by optical character recognition (OCR) for reaching editorial quality is an expensive task, as it still involves human intervention. The coverage of statistical models for automated error detection and correction is inherently limited to errors that resort to general language. However, a large amount of errors reside in domain-specific named entities, especially when dealing with data such as patent corpora or legal texts. In this paper, we propose a rule-based architecture for the identification and correction of a wide range of named entities (proper names not included). We show that our architecture achieves a good recall and an excellent correction accuracy on error types that are difficult to adress with statistical approaches.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>France</li>
</country>
</list>
<tree>
<country name="France">
<noRegion>
<name sortKey="Sagot, Benoit" sort="Sagot, Benoit" uniqKey="Sagot B" first="Benoît" last="Sagot">Benoît Sagot</name>
</noRegion>
<name sortKey="Gabor, Kata" sort="Gabor, Kata" uniqKey="Gabor K" first="Kata" last="Gábor">Kata Gábor</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000067 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000067 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Hal:hal-01022378
   |texte=   Détection et correction automatique d'entités nommées dans des corpus OCRisés
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024